Your browser doesn't support javascript.
Show: 20 | 50 | 100
Results 1 - 9 de 9
Filter
1.
ACM Web Conference 2023 - Companion of the World Wide Web Conference, WWW 2023 ; : 1204-1207, 2023.
Article in English | Scopus | ID: covidwho-20239230

ABSTRACT

Timeline summarization (TLS) is a challenging research task that requires researchers to distill extensive and intricate temporal data into a concise and easily comprehensible representation. This paper proposes a novel approach to timeline summarization using Meaning Representations (AMRs), a graphical representation of the text where the nodes are semantic concepts and the edges denote relationships between concepts. With AMR, sentences with different wordings, but similar semantics, have similar representations. To make use of this feature for timeline summarization, a two-step sentence selection method that leverages features extracted from both AMRs and the text is proposed. First, AMRs are generated for each sentence. Sentences are then filtered out by removing those with no named-entities and keeping the ones with the highest number of named-entities. In the next step, sentences to appear in the timeline are selected based on two scores: Inverse Document Frequency (IDF) of AMR nodes combined with the score obtained by applying a keyword extraction method to the text. Our experimental results on the TLS-Covid19 test collection demonstrate the potential of the proposed approach. © 2023 ACM.

2.
7th Arabic Natural Language Processing Workshop, WANLP 2022 held with EMNLP 2022 ; : 1-10, 2022.
Article in English | Scopus | ID: covidwho-2290872

ABSTRACT

Named Entity Recognition (NER) is a well-known problem for the natural language processing (NLP) community. It is a key component of different NLP applications, including information extraction, question answering, and information retrieval. In the literature, there are several Arabic NER datasets with different named entity tags;however, due to data and concept drift, we are always in need of new data for NER and other NLP applications. In this paper, first, we introduce Wassem, a web-based annotation platform for Arabic NLP applications. Wassem can be used to manually annotate textual data for a variety of NLP tasks: text classification, sequence classification, and word segmentation. Second, we introduce the COVID-19 Arabic Named Entities Recognition (CAraNER) dataset extracted from the Arabic Newspaper COVID-19 Corpus (AraNPCC). CAraNER has 55,389 tokens distributed over 1,278 sentences randomly extracted from Saudi Arabian newspaper articles published during 2019, 2020, and 2021. The dataset is labeled by five annotators with five named-entity tags, namely: Person, Title, Location, Organization, and Miscellaneous. The CAraNER corpus is available for download for free. We evaluate the corpus by finetuning four BERT-based Arabic language models on the CAraNER corpus. The best model was AraBERTv0.2-large with 0.86 for the F1 macro measure. © 2022 Association for Computational Linguistics.

3.
1st International Conference on Machine Learning, Computer Systems and Security, MLCSS 2022 ; : 301-306, 2022.
Article in English | Scopus | ID: covidwho-2294226

ABSTRACT

The COVID-19 pandemic has been accompanied by such an explosive increase in media coverage and scientific publications that researchers find it difficult to keep up. So we are working on COVID-19 dataset on Omicron variant to recognise the name entity from a given text. We collect the COVID related data from newspaper or from tweets. This article covered the name entity like COVID variant name, organization name and location name, vaccine name. It include tokenisation, POS tagging, Chunking, levelling, editing and for run the program. It will help us to recognise the name entity like where the COVID spread (location) most, which variant spread most (variant name), which vaccine has been given (vaccine name) from huge dataset. In this work, we have identified the names. If we assume unemployment, economic downfall, death, recovery, depression, as a topic we can identify the topic names also, and in which phase it occurred. © 2022 IEEE.

4.
13th International Conference on Language Resources and Evaluation Conference, LREC 2022 ; : 244-257, 2022.
Article in English | Scopus | ID: covidwho-2169133

ABSTRACT

Over the course of the COVID-19 pandemic, large volumes of biomedical information concerning this new disease have been published on social media. Some of this information can pose a real danger to people's health, particularly when false information is shared, for instance recommendations on how to treat diseases without professional medical advice. Therefore, automatic fact-checking resources and systems developed specifically for the medical domain are crucial. While existing fact-checking resources cover COVID-19-related information in news or quantify the amount of misinformation in tweets, there is no dataset providing fact-checked COVID-19-related Twitter posts with detailed annotations for biomedical entities, relations and relevant evidence. We contribute CoVERT, a fact-checked corpus of tweets with a focus on the domain of biomedicine and COVID-19-related (mis)information. The corpus consists of 300 tweets, each annotated with medical named entities and relations. We employ a novel crowdsourcing methodology to annotate all tweets with fact-checking labels and supporting evidence, which crowdworkers search for online. This methodology results in moderate inter-annotator agreement. Furthermore, we use the retrieved evidence extracts as part of a fact-checking pipeline, finding that the real-world evidence is more useful than the knowledge indirectly available in pretrained language models. © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.

5.
2022 Workshop on Creating, Enriching and Using Parliamentary Corpora, ParlaCLARIN III 2022 ; : 117-124, 2022.
Article in English | Scopus | ID: covidwho-2167388

ABSTRACT

This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes. © European Language Resources Association (ELRA).

6.
Viruses ; 14(12)2022 12 11.
Article in English | MEDLINE | ID: covidwho-2155318

ABSTRACT

The clinical application of detecting COVID-19 factors is a challenging task. The existing named entity recognition models are usually trained on a limited set of named entities. Besides clinical, the non-clinical factors, such as social determinant of health (SDoH), are also important to study the infectious disease. In this paper, we propose a generalizable machine learning approach that improves on previous efforts by recognizing a large number of clinical risk factors and SDoH. The novelty of the proposed method lies in the subtle combination of a number of deep neural networks, including the BiLSTM-CNN-CRF method and a transformer-based embedding layer. Experimental results on a cohort of COVID-19 data prepared from PubMed articles show the superiority of the proposed approach. When compared to other methods, the proposed approach achieves a performance gain of about 1-5% in terms of macro- and micro-average F1 scores. Clinical practitioners and researchers can use this approach to obtain accurate information regarding clinical risks and SDoH factors, and use this pipeline as a tool to end the pandemic or to prepare for future pandemics.


Subject(s)
COVID-19 , Natural Language Processing , Humans , COVID-19/diagnosis , Neural Networks, Computer , Machine Learning , Electronic Health Records
7.
31st ACM Web Conference, WWW 2022 ; : 823-832, 2022.
Article in English | Scopus | ID: covidwho-2029541

ABSTRACT

Since the rise of the COVID-19 pandemic, peer-reviewed biomedical repositories have experienced a surge in chemical and disease related queries. These queries have a wide variety of naming conventions and nomenclatures from trademark and generic, to chemical composition mentions. Normalizing or disambiguating these mentions within texts provides researchers and data-curators with more relevant articles returned by their search query. Named entity normalization aims to automate this disambiguation process by linking entity mentions onto their appropriate candidate concepts within a biomedical knowledge base or ontology. We explore several term embedding aggregation techniques in addition to how the term's context affects evaluation performance. We also evaluate our embedding approaches for normalizing term instances containing one or many relations within unstructured texts. © 2022 Owner/Author.

8.
31st ACM Web Conference, WWW 2022 ; : 740-750, 2022.
Article in English | Scopus | ID: covidwho-2029538

ABSTRACT

Semantic text annotations have been a key factor for supporting computer applications ranging from knowledge graph construction to biomedical question answering. In this systematic review, we provide an analysis of the data models that have been applied to semantic annotation projects for the scholarly publications available in the CORD-19 dataset, an open database of the full texts of scholarly publications about COVID-19. Based on Google Scholar and the screening of specific research venues, we retrieve seventeen publications on the topic mostly from the United States of America. Subsequently, we outline and explain the inline semantic annotation models currently applied on the full texts of biomedical scholarly publications. Then, we discuss the data models currently used with reference to semantic annotation projects on the CORD-19 dataset to provide interesting directions for the development of semantic annotation models and projects. © 2022 ACM.

9.
2021 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2021 ; : 3963-3970, 2021.
Article in English | Scopus | ID: covidwho-1722891

ABSTRACT

Biomedical named entity recognition from clinical texts is a fundamental task for clinical data analysis due to the availability of large volume of electronic medical record data, which are mostly in free text format, in real-world clinical settings. Clinical text data incorporates significant phenotypic medical entities, which could be used for profiling the clinical characteristics of patients in specific disease conditions. However, general approaches mostly rely on the coarse-grained annotations (e.g. mentions of symptom terms) of phenotypic entities in benchmark text dataset. Owing to the numerous negation expressions of phenotypic entities (e.g. 'no fever', 'no cough' and 'no hypertension') in clinical texts, this could not feed the subsequent data analysis process with well-prepared structured clinical data. Thus, we constructed a fine-grained Chinese clinical corpus. Thereafter, we proposed a phenotypic named entity recognizer (Phenonizer). The results on the test set show that Phenonizer outperform those methods based on Word2Vec with Fl-score of 0.896. By comparing character embeddings from different data, it is found that character embeddings trained by clinical corpora can improve F-score by 0.0103. Furthermore, the fine-grained dataset enables methods to distinguish between negated symptoms and presented symptoms, and avoids the interference of negated symptoms. Finally, we tested the generalization performance of Phenonier, achieving a superior F1-score of 0.8389. In summary, together with fine-grained annotated benchmark dataset, Phenonier proposes a feasible approach to effectively extract symptom information from Chinese clinical texts with acceptable performance. © 2021 IEEE.

SELECTION OF CITATIONS
SEARCH DETAIL